feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b#51
Merged
Conversation
Resolves all 3 blocked_on items the original INACTIVE stub listed,
without needing the full MCP-multiturn-with-model harness:
- workload curated: bench/workloads/mcp-tool-defs-30.jsonl (30
representative MCP tool defs across 6 categories — filesystem,
web, code, calendar, email, system)
- bench.py: applies canonical schema compression (strip
descriptions, shorten param names, hide optional params),
counts tokens before/after via cl100k_base (with deterministic
char-div-4 fallback), reports median pct reduction
- docker-compose.yml: minimal python:3.11-slim container with
tiktoken installed; reads workload from /workloads/, writes
outputs.json
- expected.json: status flipped ACTIVE; secondary metric (tool-call
accuracy delta) explicitly removed and tracked as a future
paired model-dependent sandbox
Local end-to-end measurement (no Docker, direct python bench.py):
primary_value: 70.12% median reduction (cl100k_base tokenizer)
threshold: confirm_at_least=30%
verdict: CONFIRMED — well above the 30% bar
Also locked in this PR:
- .gitignore: bench/isolation/**/outputs.json (per-run artifact,
not source of truth — bench/results/ holds the canonical summaries)
- generator script for the workload (deterministic — re-run produces
identical output)
Net effect: bench framework now has 2 ACTIVE sandboxes (vllm-q4-llama8b
+ sandbox-e), 12 INACTIVE. dry-run-all reports cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First sandbox to flip from INACTIVE to ACTIVE since the framework shipped. Sandbox E (schema compression) measures the input-token reduction from OCM's canonical MCP-tool compression recipe, with no model invocation needed for the primary metric — pure deterministic measurement.
Local validation
Ran end-to-end on the actual workload via direct `python bench.py` (no Docker):
Spec impact
Spec v0.2 row 21 claimed 30-60% reduction. Measured 70% — exceeds the upper bound. Worth a follow-up note in spec hygiene: the recipe is more aggressive than originally claimed; secondary accuracy validation becomes proportionally more important when model-dependent harness lands.
Frame
First new ACTIVE flip = ~700 lines of work (workload generator + 30-tool fixture + bench.py + compose + expected.json refit). Sets the recipe for the other 12 INACTIVE stubs as their `blocked_on` items resolve.
What this changes
🤖 Generated with Claude Code